Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ARC ghost states eviction accounting. #12279

Merged
merged 1 commit into from
Jul 13, 2021

Conversation

amotin
Copy link
Member

@amotin amotin commented Jun 24, 2021

arc_evict_hdr() returns number of evicted bytes in scope of specific
state. For ghost states it does not mean the amount of really freed
memory, but the logical buffer size. It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead. Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations. We can also do it without taking global
arc_evict_lock reducing the contention.

How Has This Been Tested?

On 40-thread FreeBSD system with 128GB of ARC and heavy uncached 4KB ZVOLs reads profiler shows reduction of CPU time spent in arc_wait_for eviction() from 1.6% to 0.2%. The fio benchmark reports the same 336K IOPS as before, just with expected higher latency above 99.9% percentile:

     |  1.00th=[   12],  5.00th=[   16], 10.00th=[   23], 20.00th=[   49],
     | 30.00th=[   54], 40.00th=[   58], 50.00th=[   62], 60.00th=[   67],
     | 70.00th=[   72], 80.00th=[   78], 90.00th=[   91], 95.00th=[  118],
     | 99.00th=[  237], 99.50th=[  379], 99.90th=[ 2073], 99.95th=[ 2638],
     | 99.99th=[ 4424]

instead of previous:

     |  1.00th=[   12],  5.00th=[   16], 10.00th=[   26], 20.00th=[   49],
     | 30.00th=[   55], 40.00th=[   59], 50.00th=[   63], 60.00th=[   68],
     | 70.00th=[   73], 80.00th=[   79], 90.00th=[   92], 95.00th=[  147],
     | 99.00th=[  338], 99.50th=[  383], 99.90th=[  478], 99.95th=[  537],
     | 99.99th=[  906]

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jun 24, 2021
@amotin amotin requested a review from ahrens June 24, 2021 14:50
Copy link
Contributor

@allanjude allanjude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed-by: Allan Jude allanjude@freebsd.org

@amotin amotin force-pushed the wait_for_eviction branch from b1fdbe9 to 1fd6f9b Compare June 24, 2021 16:18
@ahrens ahrens requested a review from grwilson June 28, 2021 16:10
module/zfs/arc.c Show resolved Hide resolved
module/zfs/arc.c Outdated Show resolved Hide resolved
zthr_wakeup(arc_evict_zthr);
}
return;
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer (to future code changes) to have this be case 2: and have the default panic. That ensures that we handle all of the possible return values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would add another branching and the panic code itself, while in this case it is really safe (just slower) to take the default path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the current code is correct. My concern is that if someone adds a new variant of the enum that has some different meaning, we want to make sure that they think about how that should be handled here, instead of just treating it the same as ARC_OVF_SEVERE. In other words, it's generally good practice for switch statements to handle all the variants and panic if an unknown variant is encountered - it's defensive against future changes. But I see that there are plenty of places in the existing code that don't do that, so oh well.

Copy link
Member Author

@amotin amotin Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On FreeBSD there is __assert_unreachable() macro, turning into panic() when built with debug and into __builtin_unreachable() otherwise if compiler supports it and into nothing if not. We could introduce something like that in ZFS, if somebody knows how to properly spell it on Linux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the argument against default: panic("invalid arc_ovf_level_t")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a waste of time for additional branching and a code trashing.

Copy link
Member

@ahrens ahrens Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I think any performance cost is trivial (but will happily change my mind if you have performance measurements to the contrary). In my opinion, indicating what states are expected and allowed makes the code more clear.

That said, while this is an interesting discussion of philosophy, I don't think it's hugely consequential to this PR - as I mentioned there are other places that have similarly problematic use of default. So I wouldn't hold up your PR over this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say I saw in some profiles before effects of alike additional cases. Depending on situation compiler may turn switch into number of if's, and additional branching in a hot path never helps. In this particular case I believe compiler will inline arc_is_overflowing(), since it is now static and used in only one place, and just throw out any additional code we'd add there.

I've described above how I see it to be done properly, both paranoid in debug builds, may be even more optimized in production and clearly visible in the code.

case 1:
if (!arc_evict_needed) {
arc_evict_needed = B_TRUE;
zthr_wakeup(arc_evict_zthr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the arc_evict_lock to set arc_evict_needed? could we also change it without the lock in arc_reduce_target_size() and arc_evict_cb()?

Copy link
Member Author

@amotin amotin Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit racy, but since we are not in overflow, it does not matter. arc_reduce_target_size() in addition to arc_evict_needed also changes the arc_c, potentially significantly, so additional fence makes sense to me. In arc_evict_cb() the locking is needed to reliably handle arc_evict_waiters list, but since it is called from only one thread, it does not matter much to performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this makes sense to me. Are you saying we don't need the arc_evict_lock here because we're not actually going to sleep which makes the comment surrounding arc_evict_needed/arc_wait_for_eviction dance meaningless in this case? Nonetheless, we need to either document this significantly or change the consumers of arc_evict_needed to be consistent. My fear is that another developer would see the lack of lock here and then deem this a bug or want to remove the lock in other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment here. Comments around arc_evict_needed/arc_wait_for_eviction are about arc_evict_waiters, not touched here.

@amotin amotin force-pushed the wait_for_eviction branch 4 times, most recently from 58a9d5b to e7f33d6 Compare July 1, 2021 15:55
case 1:
if (!arc_evict_needed) {
arc_evict_needed = B_TRUE;
zthr_wakeup(arc_evict_zthr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this makes sense to me. Are you saying we don't need the arc_evict_lock here because we're not actually going to sleep which makes the comment surrounding arc_evict_needed/arc_wait_for_eviction dance meaningless in this case? Nonetheless, we need to either document this significantly or change the consumers of arc_evict_needed to be consistent. My fear is that another developer would see the lack of lock here and then deem this a bug or want to remove the lock in other places.

module/zfs/arc.c Show resolved Hide resolved
module/zfs/arc.c Show resolved Hide resolved
@amotin amotin force-pushed the wait_for_eviction branch 2 times, most recently from 9a1f261 to b1eea76 Compare July 8, 2021 16:50
module/zfs/arc.c Outdated Show resolved Hide resolved
module/zfs/arc.c Show resolved Hide resolved
@amotin amotin force-pushed the wait_for_eviction branch 3 times, most recently from fcb8cc6 to 8aedcfd Compare July 8, 2021 21:24
@amotin amotin force-pushed the wait_for_eviction branch 2 times, most recently from 9f21c96 to 9b92aeb Compare July 10, 2021 14:43
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
@amotin amotin force-pushed the wait_for_eviction branch from 9b92aeb to 8c49357 Compare July 11, 2021 15:14
@mmaybee mmaybee added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jul 13, 2021
@mmaybee mmaybee merged commit f7de776 into openzfs:master Jul 13, 2021
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 23, 2021
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12279
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 24, 2021
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12279
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 24, 2021
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12279
@amotin amotin deleted the wait_for_eviction branch August 24, 2021 20:17
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 24, 2021
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12279
behlendorf pushed a commit that referenced this pull request Aug 31, 2021
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #12279
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 15, 2021
arc_evict_hdr() returns number of evicted bytes in scope of specific
state.  For ghost states it does not mean the amount of really freed
memory, but the logical buffer size.  It is correct for the eviction
process, but not for waking up threads waiting for ARC size reduction,
as added in "Revise ARC shrinker algorithm" commit, causing premature
wakeups while ARC is still overflowed, allowing even bigger overflow,
plus processing overhead when next allocation will also get blocked,
probably also for too short time.

To fix that make arc_evict_hdr() also return the amount of really
freed memory, which for the ghost states is only the header, and use
it to update arc_evict_count instead.  Originally I was thinking to
not return it at all, since arc_get_data_impl() does not account for
the headers, but decided that some slow allocation progress is better
than long waits, reaching on my tests up to 100ms.

To reduce negative latency effects of long time periods when reclaim
thread can free little real memory, start reclamation process earlier,
before we actually reached the overflow threshold, when we have to
throttle new allocations.  We can also do it without taking global
arc_evict_lock, reducing the contention.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#12279
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants